Skip to content

chore: migrate to master, enforce ci#1

Open
georgewhewell wants to merge 105 commits into
masterfrom
grw/feat/aiter
Open

chore: migrate to master, enforce ci#1
georgewhewell wants to merge 105 commits into
masterfrom
grw/feat/aiter

Conversation

@georgewhewell
Copy link
Copy Markdown
Contributor

No description provided.

georgewhewell and others added 30 commits February 12, 2026 22:37
Collapse the remote execution stack onto the canonical graph request shape, move quote discovery into the rpc crate, and remove the CLI-only discovery split. This also folds in the executor/client fixes needed to make the new path work end-to-end.
Document the deny-by-default serve flow, surface the active policy mode at startup, and add the Nix/docker packaging helpers that make the new deployment shape usable.
- Enable `otel` feature on tonic-iroh-transport
- Register W3C TraceContextPropagator globally
- Wrap server services with TraceContextExtractor
- Wrap client RPC channels with TraceContextInjector
- Make RemoteExecuteDriver generic to support intercepted channels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s everywhere

OpenTelemetry plumbing is now opt-in via the `otel` feature on `hellas-cli`
(default off). With the feature off, none of opentelemetry / opentelemetry_sdk
/ opentelemetry-otlp / tracing-opentelemetry / reqwest compile, and the
trace-context propagation glue collapses to identity. Plain `tracing::info!`
/ `warn!` / etc. macros stay unconditional — they're no-op-cheap without a
subscriber.

Cfg surface is concentrated, not sprinkled:
- One function pair in `tracing_config.rs` (`install_with_otel`); registry
  composition stays cfg-free. `TracerGuard` newtype hides the
  `Option<SdkTracerProvider>` behind a cfg-gated field so `main.rs` drops the
  `if let Some(provider) = ...` dance around shutdown.
- `execution.rs`: cfg-swap of the `TracedChannel` type alias plus a
  `traced(channel)` helper collapses 8 `InterceptedService::new(channel,
  TraceContextInjector)` sites and avoids spreading cfg across the file.
- `serve/node.rs`: a single `traced_service<S>` helper replaces 5
  `trace_layer.layer(...)` sites.

iroh's internal `EndpointMetrics` are bridged into the existing
`prometheus-client` registry exposed at `/metrics`. The cli's `otel` feature
also enables `tonic-iroh-transport/metrics`, and `serve` attaches
`endpoint.metrics()` (clone of Arcs into live storage) via
`MetricsBundle::with_iroh`. The HTTP handler emits prometheus-client text
followed by iroh's OpenMetrics text in one well-formed response with a single
`# EOF` terminator. Verified end-to-end: `endpoint_socket_send_ipv4_total`
etc. show up alongside `hellas_*` counters.

Switch all TLS to rustls so the crate compiles in weird places (wasm,
cross-compile, no system openssl):
- Workspace `tonic-iroh-transport`: drop `["otel", "native-defaults"]`,
  pin to v0.9.2, use granular features `["tls-ring", "portmapper",
  "fast-apple-datapath"]`. v0.9.2 exposes the new passthroughs.
- Workspace `reqwest`: switch to `["rustls", "webpki-roots"]` (was
  `["rustls-native-certs"]`, which lacks an actual TLS provider — the cause
  of the `"invalid URL, scheme is not http"` symptom against
  jaeger.lsd-ag.ch).
- `opentelemetry-otlp`: add `"reqwest-rustls-webpki-roots"` so its internal
  `reqwest 0.12` (separate from our 0.13) gets a TLS provider too.
- `crates/executor/Cargo.toml`: `hf-hub = "0.5"` was secretly pulling
  `native-tls` -> `openssl-sys` via default features. Pin to
  `default-features = false, features = ["ureq"]`, matching `hellas-rpc`.
- Drop `pkgs.openssl` from `nix/default.nix`, `nix/docker.nix`,
  `nix/package.nix`. `ldd target/debug/hellas-cli` is now empty for
  `libssl`/`libcrypto` in both default and `candle,otel` builds.

Dev workflow:
- `rust-analyzer.toml` at workspace root pins RA's feature set to
  `["candle", "otel"]` so type-checking covers gated modules across
  editors. Replaces the abandoned `HELLAS_FEATURES` env var / cargo shim
  approach.
- `nix/default.nix:88`: `hellas-run` wrapper drops
  `--features "${HELLAS_FEATURES:-candle}"` in favor of explicit
  `--features candle`.

Build matrix: all four cli feature combos compile (`{}`, `candle`, `otel`,
`candle,otel`); workspace check + clippy clean; HTTPS OTLP connect now
succeeds (DNS to jaeger.lsd-ag.ch is environmental).
- expose individual check-{fmt,clippy,sort,test} apps for matrix dispatch
- add `cargo test --workspace` check (default features)
- workflow enumerates check-* apps from the flake, runs each on
  `[self-hosted, shared]`; gates with `CI passed` aggregate job
- opt-in `cache.hellas.ai` substituter via flake.nix nixConfig
writeShellApplication strips PATH down to runtimeInputs only, so cargo's
default linker invocation (`cc`) was failing on the runner.
Single source of truth in nix/ci.nix is now an attrset
{ name -> { check, fix? } }. Exposed flat as `.#ci.<system>.commands`
for the GitHub Actions matrix to enumerate. CI runs each command via
`nix develop -c` so the dev shell is the runtime environment — no
more per-check writeShellApplication wrappers with hand-curated input
lists.

Workflow gains a `devshell` warmup job (clean failure surface for
env issues) and drops `--accept-flake-config` (runner daemon already
trusts cache.hellas.ai; flake nixConfig is for downstream users).
`nix run .#fix` previously fell back to running the check command
for entries without a `fix` field — so it ran cargo test (slow) and
cargo outdated (heuristic) during fix mode. Now those are filtered
out entirely; fix only runs entries with an explicit fix variant.
Mechanical changes from `nix run .#fix`: rustfmt across the workspace,
cargo-sort across all Cargo.toml files, and clippy's --fix for the
auto-resolvable lints (collapsible if/let chains).
EnqueueError / StartExecutionError now wrap ExecuteJob in Box (was
~232 bytes inline). PreparedRoute / OpaquePreparedRoute box the
RemoteDirect variant which held a ~1KB RemoteExecution. The remaining
variant-size disparity in the route enums is annotated with
`#[allow(clippy::large_enum_variant)]` — the variants are heterogeneous
by nature and the enum lives only briefly during execution setup.
After `CI passed`, on push only, build the slow targets in parallel:

  static-x86_64   cross.x86_64-linux-musl.cli
  static-aarch64  cross.aarch64-linux-musl.cli
  docker-cpu      default cpu image
  docker-cuda     alias of docker-cuda12-sm89 (new)

Matrix is driven from `.#ci.<sys>.builds` (same data-driven pattern as
`commands`). `Extended builds passed` is a separate gate from
`CI passed` so branch protection can require them independently.
@georgewhewell georgewhewell force-pushed the grw/feat/aiter branch 7 times, most recently from 120cd35 to 9f36f95 Compare May 11, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant